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ABSTRACT 


In this paper we compare different kernel had been developed for support 
vector machine based time series classification. Despite the better presentation 
of Support Vector Machine (SVM) on many concrete classification problems, the 
algorithm is not directly applicable to multi-dimensional routes having different 
measurements. Training support vector machines [SVM] with indefinite kernels 
has just fascinated consideration in the machine learning public. This is 
moderately due to the fact that many similarity functions that arise in practice 
are not symmetric positive semidefinite. In this paper, by spreading the Gaussian 
RBF kernel by Gaussian elastic metric kernel. Gaussian elastic metric kernel is 
extended version of Gaussian RBF. The extended version divided in two ways- 
time wrap distance and its real penalty. Experimental results on 17 datasets, 
time series data sets show that, in terms of classification accuracy, SVM with 
Gaussian elastic metric kernel is much superior to other kernels, and the 
ultramodern similarity measure methods. In this paper we used the indefinite 
resemblance function or distance directly without any conversion, and, hence, it 
always treats both training and test examples consistently. Finally, it achieves 
the highest accuracy of Gaussian elastic metric kernel among all methods that 
train SVM with kernels i.e. positive semi-definite [PSD) and Non-PSD, with a 
statistically significant evidence while also retaining sparsity of the support 
vector set. 
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1. INTRODUCTION 

We motivated of kernel algorithm because. Firstly, linearity 
is moderately special, and outside mathematically no model 
of a real system is actually linear. Secondly, detecting linear 
relations has been the focus of much research in statistics, 
soft computing and machine vision for decades and the 
resulting algorithms are well understood, well developed 
and efficient. Naturally, one wants the best of both worlds. 
So, if a problem is non-linear, instead of trying to fit a non¬ 
linear model, one can map the problem from the input space 
to a new [higher-dimensional] space [called the feature 
space] by doing a nonlinear transformation using suitably 
chosen basis functions and then use a linear model in the 
feature space. This is known as the 'kernel trick'. The linear 
model in the feature space corresponds to a non-linear 
model in the input space. This approach can be used in both 
classification and deterioration problems. The choice of 
kernel function is crucial for the success of ah kernel 
algorithms and its variety of types because the kernel 
establishes preceding knowledge that is available about a 
task. Accordingly, there is no free dine in kernel choice. 

According to Martin Sewell, 2007- term kernel is resulting 
from a word that can be sketched back to c. 1000 and 
originally meant a seed [contained within a fruit] or the 
softer [usually edible] part contained within the hard shell of 


a nut or stone-fruit. The former meaning is now superseded. 
It was first used in reckoning when it was defined for 
integral equations in which the kernel is known and the 
other function[s] unknown, but now has several meanings in 
mathematics. The machine learning term kernel trick was 
first used in 1998. 

In linear algebra we know that any symmetric matrix K with 
real valued entries can be written in the form K = PDP^ 

where P = FV, . , F],J, FT are eigen vectors of K 

that form an orthonormal basis [so we also have = P“^] 
and where D is a diagonal matrix with being the 

corresponding eigen values. A square matrix A is positive 
semi-definite [PSD] i_ for ah vectors c we have 
c^Ac = 2-2,- CfCi A- - > 0 . It is well known that a 

'■ J t J - 

matrix is positive semi-definite iff all the eigen values are 
non-negative. 

In this paper we check the condition of symmetric positive 
semidefinite with the help of Mercer's Theorem according to 
the Mercer's Theorem: 
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The sample S includes m examples. The 

Kernel [Gram] matrix K is an m X rn matrix including inner 
products between all pairs of examples i.e.,fe. - = 
is symmetric since 

Mercer's Theorem: 

A symmetric function ArC, ^ ,) is a kernel iff for any finite 
sample S the kernel matrix for 5 is positive semi-definite. 

One direction of the theorem is easy: if A:( } is a kernel, and 
K is the kernel matrix with = A:(A'i,Ji',-),Then 

C^Kc = y.-X}C:C,K,,j = Y.i'LjC CjO'yX.)o{x/) = 

0v.V;).''vIl:C0(y.)'' = - 0- 

Theorem: 

Consider a finite input space J = j^'-and the 

kernel matrix K over the entire space. If K is positive semi- 
definite then A:(, , , ) is a kernel function. 

Proof: By the linear algebra facts above we can write 
K = FDP^. 

Define a feature mapping into a m-dimensional space where 
the 1th bit in feature expansion for the other direction we 
will prove a weaker result. 

Example X^ is 
The inner product is 

?7i 

0(x' ) ■ 0(y-'] = ^ 0A''x'’)0Afjf-'’) 

L = 1 

= Y,^AVi}t 

!=1 

We want to show that 

= 0(f')-0(;w") 

Consider i,/entry of the matrix A' = We have 

the following identities where the last one proves the result. 

= [PCPHy 


[PD] = 

[PD]. . = (l.-9^A.; 


score between protein sequences, use set operations such as 
union/intersection in defining similarity between 
transactions, use human-judged similarities between 
concepts and words, use the symmetrized Kullback-Leibler 
divergence between probability distributions, use d 3 mamic 
time warping for time series, or use the refraction distance 
and shape matching distance in computer vision [1,2,3,4]. 
Outspreading SVM to indefinite kernels will greatly expand 
its applicability. Recent work on training SVM with indefinite 
kernels has generally warped into three categories: Positive 
semidefinite (PSD) kernel approximation, non-convex 
optimization [NCO] and learning in Krein spaces [LKS]. In 
the first approach, the kernel matrix of training examples is 
altered so that it becomes PSD. The motivation behind such 
approach is to assume that negative eigenvalues are caused 
by noise [5,6]. The concluding approach was introduced by 
Luss and d'Aspremont in 2007 with enhancements in 
training time reported [7,8,9]. All the kernel approximation 
methods above guarantee that the optimization problem 
remains convex during training. During experiment, 
however, the original indefinite kernel function is used. 
Hence, training and test examples are treated 
contradictorily. In addition, such methods are only useful 
when the similarity matrix is approximable by a PSD matrix. 
For other similarity functions such as the sigmoid kernel that 
can occasionally yield a negative semidefinite matrix for 
certain values of its hyper-parameters, the kernel 
approximation approach cannot be utilized. 

In the second approach, non-convex optimization methods 
are used. SMO t 3 rpe decomposition might be used in finding a 
local minimum with indefinite similarity functions [10]. 
Haasdonk interprets this as a method of minimizing the 
distance between reduced convex hulls in a pseudo- 
Euclidean space [4]. However, because such approach can 
terminate at a local minimum, it does not assurance learning 
[1]. Similar to the previous approach, this method only 
works well if the similarity matrix is nearly PSD. 

The next approach that has been proposed in the writings is 
to extend SVM into the Krein spaces, in which a reproducing 
kernel is decomposed into the sum of one positive 
semidefinite kernel and one negative semidefinite kernel 
[11,12]. Instead of minimizing regularized risk, the objective 
function is now stabilized. One fairly recent algorithm that 
has been proposed to solve the stabilization problem is 
called Eigen-decomposition SVM (ESVM) [12]. While this 
algorithm has been shown to outperform all previous 
methods, its primary drawback is that it does not produce 
sparse solutions, hence the entire list of training examples 
are often needed during prediction. 


! = 1 

Note that Mercer's theorem allows us to work with a kernel 
function without knowing which feature map it corresponds 
to or its relevance to the learning problem. This has often 
been used in practical applications. 

In real-life solicitations, however, many similarity functions 
exist that are either indefinite or for which the Mercer 
condition is difficult to verify. For example, one can 
incorporate the longest common subsequence in defining 
distance between genetic sequences, use BLAST similarity 


The main contribution of this paper is to establish both 
theoretically and experimentally that the 1-norm SVM [13], 
which was proposed more than 10 years ago, is a better 
solution for extending SVM to indefinite kernels. More 
specifically, 1-norm SVM can be interpreted as a structural 
risk minimization method that seeks a decision boundary 
with large similarity margin in the original space. It uses a 
linear algebra preparation that remains convex even if the 
kernel matrix is indefinite, and hence can always be solved 
quite efficiently. It uses the indefinite similarity function (or 
distance] directly without any transformation, and, hence, it 
always treats both training and test examples consistently. 
In addition, it achieves the highest accurateness among all 
the methods that train SVM with indefinite kernels, with a 


@ IJTSRD I Unique Paper ID - 1JTSRD23437 | Volume - 3 | Issue - 3 | Mar-Apr 2019 


Page: 1646 





International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.iitsrd.com elSSN: 2456-6470 


statistically important indication, while also retaining 
sparsity of the support vector set. In the literature, 1-norm 
SVM is often used as an surrounded feature selection 
method, where learning and feature selection are performed 
concurrently [14, 13, 15, 17, 16,18]. It was studied in [13], 
where it was argued that 1-norm SVM has an advantage over 
standard 2-norm SVM when there are redundant noise 
features. To the knowledge of the authors, the advantage of 
using 1-norm SVM in handling indefinite kernels has never 
been established in the writings. 


moments of data [28]. One fundamental limiting factor in 
SVM is the need for positive semidefinite kernels. 

2. Methods 

In standard two-class classification problems, we are given a 
set of training data y.,). (jc„,y„), where the input 

E and the output _Vj E (i^ — IJisbnary.Wewishto 
find a classification rule from the training data, so that when 
given a new input x, we can assign a class y from [ i, — 1} to 


As astate-of-the-artclassifier,supportvectormachine [SVM) 
has also been examined and applied for time series 
classification in two modes. On one hand, combined with 
various feature extraction approaches, SVM can be adopted 
as a plug-in method in addressing time series classification 
problems. On the other hand, by designing appropriate 
kernel functions, SVM can also be performed based on the 
original time series data. Because of the time axis distortion 
problem, classical kernel functions, such as Gaussian RBF 
and polynomial, generally are not suitable for SVM-based 
time series classification. Motivated by the success of 
dynamic time wrapping distance, it has been suggested to 
utilize elastic measure to construct appropriate kernel. 
Gaussian DTW kernel is then proposed for SVM based time 
series classification [19, 20]. Counter-examples, however, 
has been subsequently reported that GDTW kernel usually 
cannot outclass GRBF kernel in the SVM framework. Lei and 
Sun [21] proved that GDTW kernel is not positive definite 
symmetric acceptable by SVM. Experimental results [21,22] 
also showed that SVM with GDTW kernel cannot outperform 
either SVM with GRBF kernel or nearest neighbor classifier 
with DTW distance. The poor performance of the GDTW 
kernel may be attributed to that DTW is non-metric. 
Motivated by recent progress in elastic measure, Zhang et.al 
propose anew class of elastic kernel it is an allowance to the 
GRBF kernel [23] .There are lots of Advantages of kernel and 
its t 3 q)es so some of the types we used in this paper for 
classification [24]: 

> The kernel defines a similarity measure between two 
data points and thus allows one to incorporate prior 
knowledge of the problem domain. 

> Most importantly, the kernel contains all of the 
information about the relative positions of the inputs in 
the feature space and the actual learning algorithm is 
based only on the kernel function and can thus be carried 
out without explicit use of the feature space. The training 
data only enter the algorithm through their entries in the 
kernel matrix [a Gram matrix), and never through their 
individual attributes. Because one never explicitly has to 
evaluate the feature map in the high dimensional feature 
space, the kernel function represents a computational 
shortcut. 

> The number of operations required is not necessarily 
proportional to the number of features. Support vector 
machines is one of the most prevalent classification 
algorithms. It is inspired by deep learning practicalities, 
which make use of the Vapnik-Chervonenkis dimension 
to establish the generalization ability of such clan of 
classifiers [25, 26]. However, SVM has its limitations, 
which motivated development of numerous variants 
including the Distance Weighted Discrimination 
algorithm to deal with the data stacking phenomenon 
observed in large dimensions [27] and second order 
conduit programming techniques for handling uncertain 
or missing values assuming availability of second order 


it. 


To handle this problem, we consider the 1-norm support 
vector machine: 

LT 


min 


t=i 


2 i^i) 


( 1 ) 


s.t = |/?i + ■■■ ,, 4- |^^|< s, (2) 

Where ^ (r). ..., (r)}a dictionary of basis 

functions, and s is a tuning parameter. The solution is 
denoted as ana the fitted model is 


fix) = 


( 3 ) 


/-I 


The classification rule is given byji^ft- = /(a')]- The 1- 
norm SVM has been successfully used in classification. We 
argue in this paper that the 1-norm SVM may have some 
advantage over the standard 2-norm SVM, especially when 
there are redundant noise features. To get a good fitted 
model /(jc) that performs well on future data, we also need 
to select an appropriate tuning parameter s. In practice, 
people usually pre-specify a finite set of values for S that 

covers a wide range, then either use a separate validation 
data set or use cross-validation to select a value for s that 
gives the best performance among the given set. 

3. Large similarity margins 

Given a similarity function A X ? fi between 

examples A; and jc,-, we can define similarity between an 
example .A-, and a class y = i to be a weighted sum of 

similarities with all of its examples. In other words, we may 
write: 

stjCf.O = = 0 [4] 

To denote class similarity between X^ and a class y = 1. 
Here, the weight represents importance of the example JCj 
to its class Vj. In addition, we can introduce an offset b that 
quantifies prior preference. Such offset plays a role that is 
similar to the prior in Bayesian methods, the activation 
threshold in neural networks, and the offset in SVM. Thus, 
we consider classification using the rule: 


sign{s[xt,+l)-s[xt,-l)+b}, 


(5) 
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Which is identical to the classification rule of 1-norm SVM 
given in Eq 4. Moreover, we define the similarity margin 
for example x. in the usual sense: 

M- = si.ic-, i:'-) — 3(.r-, —1^-) y^h (6) 


Maximizing the minimum similarity margin can be 
formulated as a linear program CLP). 

First, we write: 


Subject to 


max M 

Xb.M 


sC^oyi) “ “ Ji) + yi ^ 0 


A> 0 


However, the decision rule given by Eq. [6] does not change 
when we multiply the weights X by any fixed positive 
constant including constants that are arbitrarily large. This is 
because the decision rule only looks into the sign of its 
argument. In particular, we can always rescale the weights X 
to be arbitrarily large, for which — ? OC . This degree of 


the same length and cannot handle the problem of time axis 
distortion. If the length of two time series is different, re¬ 
sampling usually is required to normalize them to the same 
length before further processing. Thus SVM with GRBF 
kernel (GRBF-SVM) usually is not suitable for time series 
classification. Motivated by the effectiveness of elastic 
measures in handling the time axis distortion, it is 
interesting to embed elastic distance into SVM-based time 
series classification. Generally, there are two kinds of elastic 
distance. One is non-metric elastic distance measure, e.g. 
DTW, and the other is elastic metric, which is elastic distance 
satisfying the triangle inequality. Recently, DTW, one state- 
of-the-art elastic distance, has been proposed to construct 
the GDTW kernel [19, 20]. Subsequent studies, however, 
show that SVM with GDTW kernel cannot consistently 
outperform either GRBF-SVM or INN-DTW. 

We assume that the poor performance of the GDTW kernel 
may be attributed to that DTW is non-metric, and suggest 
extending GRBF kernel using elastic metrics. Thus, we 
propose a novel class of kernel functions, Gaussian elastic 
metric kernel [GEMK] functions. 


freedom implies that we need to maximize the ratio M 
instead of maximizing M in absolute terms. Here, any norm 
II ■llsuffices but the 1-norm is preferred because it produces 

sparse solutions and because it gives better accuracy in 
practice. 

Since our objective is to maximize the ratio M / 1| /I ||, we can 

fix M = 1 and minimize lx ||. In addition, to avoid over-fitting 

outliers or noisy samples and to be able to handle the case of 
non-separable classes, soft-margin constraints are needed as 
well. Hence, 1-norm SVM can be interpreted as a method of 
finding a decision boundary with a large similarity margin in 
the original space. Such interpretation holds regardless of 
whether or not the similarity function is PSD. Thus, we 
expect 1-norm SVM to work well even for indefinite kernels. 

Similar to the original SVM, one can interpret 1-norm SVM as 
a method of striking a balance between estimation bias and 
variance. 

4. Gaussian Elastic Metric Kernel (GEMK) 

Before the definition of GEMK, we first introduce the GRBF 
kernel, one of the most common kernel functions used in 
SVM classifier. Given two time series x andy with the same 
length n, the GRBF kernel is defined as where a is the 
standard deviation. 

^’sEFi^>y) = 

GRBF kernel is a PDS kernel. It can be regard as an 
embedding of Euclidean distance in the form of Gaussian 
function. GRBF kernel requires the time series should have 


5. Experiments and Results 

In this section, we present experimental results of appl 3 nng 
different SVM to image classification problems, and 
determine its efficiency in handling indefinite similarity 
functions. As shown in last Figure 1, when the similarity 
function is PSD, performance of Gaussian TWED SVM is 
comparable to that of SVM. There are different dataset [1, 
29-35] we used for measuring the performance. When 
running statistical significance tests, we find no statistically 
significant evidence that one method better the other at the 
96.45% confidence level. The 1-norm SVM method achieves 
the highest extrapolative accuracy among all methods that 
learn with indefinite kernels, while also retaining sparsity of 
the support vector set other than GTWED SVM. Using the 
error rate as the performance indicator, we compare the 
classification performance of Gaussian elastic matching 
kernel SVM with other different similarity measure methods, 
including nearest neighbor classifier with Euclidean 
(INNED), nearest neighbor classifier with DTW [INN-DTW] 
nearest neighbor classifier with ODTW [INN-ODTW], 
nearest neighbor classifier with ERP [INN-ERP] and nearest 
neighbor classifier with OTWED (INN-OTWED). Table 1 lists 
the classification error rates of these methods on each data 
set. In our experiments, GRBF-SVM takes the least time 
among all above kernel methods. Because the complexity of 
Euclidean distance in GRBF kernel is 0[n}, while in GDTW, 
GERP and GTWED, the complexity of DTW, ERP and TWED is 
Besides, the numbers of support vectors of GERP- 

SVM and GTWED GTWED-SVM, which are comparable to 
that of GDTW-SVM, both are more than that of GRBF-SVM. 
Thus, compared with GRBF-SVM, it also takes more time for 
GERP-SVM, GTWED-SVM and GDTW-SVM [23]. 
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Figurel:. COMPARATIVE STUDY USING THE DIEFERENT TIME SERIES DATA SETS: CLASSIEICATION ERROR RATES 
[AVERAGE TEST ERROR RATE RESULTS) OBTAINED USING SIMILARITY MEASURE METHODS AND SVM CLASSIFIERS 

WITH DIEFERENT KERNELS 

COMPARATIVE STUDY USING THE DIFFERENT TIME 
SERIES DATA SETS: CLASSIFICATION ERROR RATES 
(AVERAGE TEST ERROR RATE RESULTS) OBTAINED 
USING SIMILARITY MEASURE METHODS AND SVM 
CLASSIFIERS WITH DIFFERENT KERNELS 
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6. Conclusion 

Widespread research determination has been enthusiastic 
recently to training support vector machines [SVM] with 
indefinite kernels, in this paper, we establish theoretically 
and experimentally that a variant of kernels. We Compare 
the Study Using the Different Time Series Data Sets: 
Classification Error Rates [Average Test Error Rate Results) 
Obtained Using Similarity Measure Methods and SVM 
Classifiers with Different Kernels. The 1-norm SVM method 
formulates large-margin separation as a convex linear 
algebra problem without requiring that the kernel matrix be 


positive semidefinite. It uses the indefinite similarity 
function directly without any transformation, and, hence, it 
always treats both training and test examples consistently. 
In addition, Gaussian metric kernel methods in the figure 
achieves the highest accuracy among all methods that train 
SVM with kernels, with a statistically significant evidence, 
while also retaining sparsity of the support vector set. This 
important singularity property ensures that the 1-norm SVM 
is able to delete many noise features by estimating their 
coefficients by zero. 
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